Summary
- The goal of this project is to use machine learning to predict Ag
archiving capacity (or ability) in LECs from naive mice.
- For a proof of principle analysis, Ag-tracking data for d14 cLECs
was used to train a random forest classifier to predict Ag status
- Using this model we defined a gene program that correlates with Ag
status at various timepoints
- Archiving “competent” cLECs can be predicted in the CHIKV LN
scRNA-seq data
- There is a reduction in archiving-competent cLECs in CHIKV-infected
mice and a broad downregulation of the Ag-archiving gene program
- The central goal for this project is to optimize the model
(e.g. expand to other cell types) and use it to assess archiving
capacity in samples that did not receive an Ag-tag (e.g. other published
datasets). We can then identify perturbations/treatments etc that are
predicted to impair archiving.
Classifying Ag-high
Ag-low and -high cells were identified by separately clustering each
LEC subset for each sample into two groups based on Ag-score. For the
6wk-3wk sample, the 3wk Ag score is used. Ag-low/high classifications
used for the analysis are shown below.

A random forest classifier was trained using data for d14 cLECs. The
model was then used to predict Ag-high cells in the other Ag
datasets.
The fraction of cells belonging to each predicted Ag group is shown
on the left for cLECs from each sample. The fraction of true Ag-low,
true Ag-high, and false-positive Ag-high cells (high-pred) is shown on
the right.
- The model is fairly accurate in predicting Ag-high cells in the
training and test data (d14 cLECs), but does not perform as well when
predicting Ag-low cells, this can be improved with more
optimization
- Since we want to identify gene signatures that are expressed in
naive mice and continue to be expressed after Ag levels have fallen, we
expect to observe an increasing fraction of false positive Ag-high cells
for the later timepoints.

Model accuracy was assessed for each LEC subset. F1 scores are shown
for different combinations of testing and training data.
- Ag-high cLECs, collecting, and fLECs are easiest to predict. All
models show high F1 scores when tested using these LEC subsets.
- Ag-high Ptx3 LECs and BECs are most difficult to accurately predict.
This is expected since these cell populations have the lowest Ag signal
and the fewest Ag-high cells.
- The F1 score is not always highest when the models are tested using
the training cell type. This is not necessarily surprising since the
models were selected using several metrics in addition to the F1
score.

Ag modules
Expression of the top upregulated (top) and downregulated (bottom)
gene modules that are most predictive of Ag signal are shown below.
- There is a notable correlation between the expression of these genes
and the Ag class
- False positive Ag high cells (high-pred) show an intermediate level
of expression that falls roughly between the true Ag-low and true
Ag-high cells.
- The false positive Ag high cells are potentially cells that are
archiving-competent but have now lost/released most Ag at the later
timepoints

UMAP projections show Ag-high module expression (top), true Ag-low vs
true Ag-high (middle), and false-positive Ag-high (high-pred) vs true
Ag-high (bottom).
- False positive Ag-high cells (high-pred) show strong overlap with
true Ag-high cells
cLEC

Collecting
